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Abstract. Due to memory limitations, iterative methods have become the method of choice for 
large scale semiconductor device simulation. However, it is well known that these methods still suffer 
from reliability problems. The linear systems which appear in numerical simulation of semiconductor 
devices are notoriously ill-conditioned. In order to produce robust algorithms for practical problems, 
careful attention must be given to many implementation issues. This paper concentrates on strategies 
for developing robust preconditioners. In addition, effective data structures and convergence check 
issues are also discussed. These algorithms are compared with a standard direct sparse matrix solver 
on a variety of problems. 
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1. Introduction. The increasing demands for more accurate semiconductor de- 
vice modeling have pushed the development of numerical methods to a new era. One 
of these new developments is the study of techniques for the solutions of very large 
ill-conditioned sparse linear systems. In the past, software developers favored the use of 
sparse direct methods. These solvers could be treated as black box modules, since a reli- 
able solution could (almost) always be obtained. However, the memory requirements of 
these direct methods makes them impractical for large scale simulations. Consequently, 
attention has shifted to the use of iterative methods for device simulation. 

During the past decade, there has been considerable interest in Krylov subspace 
acceleration coupled with various preconditioning techniques. A great deal of effort [2, 
19, 24, 25, 26, 13] has been devoted to developing iterative acceleration methods, while 
robust preconditioners have received less attention, mainly because of inherent theo- 
retical difficulties. However, a good preconditioner is necessary for a robust iterative 
algorithm. Many workers have observed that a superior preconditioner can boost per- 
formance by an order of magnitude[27], while a better acceleration technique may only 
improve the performance by 10 — 30%. In [4, 24], various new developments in iterative 
methods for device simulation have been summarized. 

Although iterative methods seem to be the only practical choice for large scale 
simulations, direct methods are still used in many situations. This is because many 
existing iterative methods may fail to converge when the matrix is very ill-conditioned 
and careful attention is not given to many implementation issues. 

The objective of this article is to investigate several issues which are important for 
the performance of iterative methods in device simulation. In particular, our emphasis 
will be on construction of an effective preconditioner. Both level based [ ] and drop 
tolerance [ ] preconditioners will be tested. A new two step preconditioner, which treats 
the electric potential terms in the Jacobian in a very accurate manner, will also be 
developed. Simple block scaling [ ], which has been suggested as a preconditioner for 
device simulation problems, will also be compared to the above methods. Some other 
issues which will also be addressed include the ordering of the unknowns [ ], choice of 
acceleration method, and the convergence check criteria. All these methods were tested 
in a commercially available drift diffusion device simulator [ ] which uses full Newton 
iteration for solution of the nonlinear algebraic equations. 

2. CHORDV and test problems. CHORD System Vjref JRF McMacken, SG 
Chamberlain, CHORD: A modular semiconductor device simulation development tool 
incorporating external network models, IEEE Trans. CAD, v8, n8, p826-836, 1989] is 
a semiconductor device simulator which uses fully coupled Newton iteration to solve a 
variety of carrier transport models. In this paper, we are concerned with the traditonal 
two- carrier, drift-diffusion equations: Poisson’s equation and the electron and hole 
current continuity equations. 
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Here, is the electrostatic potential, p, n the hole and electron concentrations, J n , J p the 
corresponding current densities, Nq,Na the ionized donor and acceptor concentrations 
and R the net recombination. Using the drift-diffusion approximation, we can write the 
electron and hole currents as 


Jn = ~qp n nVip + qD n Vn 


J p = -qpppVip - qD p Vp 


where p n ,p p and D n ,D p are the carrier mobility and diffusion coefficients. These ex- 
pressions are combined to yield a system of three equations in three unknowns (r/>,n,p). 
Our carrier mobility models are taken from Nishida and Sah[ref T. Nishida, C-T. Sah, 
A physically based mobility model for MOSFET numerical simulation, IEEE Trans ED, 
vED-34, n2, p310-320, 1987] and includes components due to lattice vibration, ionized 
impurities, surface scattering and velocity saturation. The recombination term includes 
Shockley-Read-Hall, Auger and impact ionization. We convert the diffusion coefficients 
to effective mobilities using the Einstein relation. 

The three equations are discretized across a two-dimensional domain using box 
integration [ref CS Rafferty, MR Pinto, RW Dutton, Iterative methods in device simula- 
tion, IEEE Trans. CAD, vCAD-4, n4, p462-471, 1985] applied to a non-orthogonal grid. 
Consider the grid node i and its neighbours j, j+1, j+2, etc. shown in Figure[Figure C.l 
from my thesis]. In the box integration method, we begin by constructing perpendicular 
bisectors of the grid to develop a control volume. Note that this approach restricts us 
to using grid meshes which do not contain obtuse angles. 

Next we assume that the electric field and current are constant across each face of 
the polygon as well as all physical properties. Using a simple difference form for the 
field, Poisson’s equation becomes 
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To discretize the continuity equations we need a difference form for the partial 
derivatives over time. We use the simple backward Euler expression 


dj A +1 - h 
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Thus, the electron continuity equation becomes 
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The Scharfetter-Gummel approximation for current density[ref DL Scharfetter, HK 
Gummel, Large signal analysis of a silicon Read diode oscillator, IEEE Trans. ED, 
vED-16, nl, p64-77, 1969] is given by 

Jnjk + 1 = j [^ifc+l B{ Sip jfc+i / <pT^ Tlj k +\ B(Slpj k +\ J (pf)} 

*3 

where B(x) is the Bernoulli function 


B(x) 


x 

exp x — 1 


and <j>T is the thermal voltage. Thus, the resulting discrete form of the electron conti- 
nuity equation is 
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A similar expression may be developed for the hole continuity equation. 

In CHORD, the transport model is solved using Newton iteration. Given a set 
of equations with residuals r k , k = l..n, we linearize the system about a point x k and 
solve the linear system J k Ax k = —r(x k ) where J k is the Jacobian matrix formed by 
computing partial derivatives of r k . Our solution estimate is then updated by the 
Newton step Ax and the process repeated until the system is converged. In this paper, 
we will focus on iterative methods of solving the linear system. 

Two typical semiconductor models: Metal-Oxide-Semiconductor Field-Effect Tran- 
sistor (MOSFET) and bipolar junction transistor (BJT) are used for testing purposes. 
The n-channel MOSFET device is a simplified self-aligned n-channel MOSFET with a 
2um drawn channel length and a 25nm thick gate insulator (see Figure 2.1). The source 
and drain are 0.25um abrupt junctions in a lightly doped p-type substrate (5.0e+15 cra- 
3). There is no channel implant. We assume an ideal structure with no oxide charge or 
interface charge. The MOSFET problem has 15587 unknowns. 


Fig. 2 . 1 . An n-channel MOSFET 

The second device is a bipolar junction transistor (see Figure 3.1) that is an active 
three-terminal device which can be used as an amplifier or switch. There are areas of 
applications in which the bipolar transistor is superior to MOSFET, such that, in high- 
power devices and in high-speed logics for high-performance computers. This device is 
a simple vertical npn transistor formed by two ion-implant steps and a thermal anneal, 
an n+ buried layer is used to reduce the parasitic collector resistance. The BJT problem 
has 13758 unknowns. 


4 



TEST PROBLEM 3 ??????? 

Each test problem for a given device model consists of a sequence of subrproblems 
for a gradually increasing value of source-drain voltage. Each of these subproblems 
requires many Newton steps, which in turn requires many Newton solves. Consequently, 
each test problem exercises the matrix solution algorithm over many different Newton 
iterations and values if the source-drain voltage. Consequently, the total time for the 
entire test problem is a good indication of the robustness and reliability of the matrix 
solver over a wide range of situations. 

The Newton iteration convergence criteria was ?????????. 

3. Inner Iteration Convergence Criteria. The complete solution process in 
a device simulation consists of an inner and outer iteration. The outer iteration is 
the nonlinear Newton method. At each Newton step, an ill-conditioned linear system 
is solved. If the Jacobian is solved using an iterative method, then the convergence 
tolerance must be specified. For the first few Newton steps, the nonlinear residual 
in (4.2) is quite large. Therefore, a large residual reduction in the linear iteration is 
needed for an accurate Ax. After the Newton iterations reaches a certain accuracy, 
the nonlinear residual becomes smaller. The update Ax only affects the last few digits 
of the final solution. As a result, the relative error requirement for the solution of the 
Jacobian can be lowered. 

Here a dynamic switch of the linear convergence criteria is implemented in the linear 
solver. Initially, the convergence requirement is that the intial / 2 residual be reduced 
by 10~ 4 * 6 . After the nonlinear residual is reduced below 10 -3 (compared to the initial 
nonlinear residual, the linear residual reduction switchs to 10 -3 . This dynamic switch 
can generally save 10-15% of the CPU time. 


Fig. 3 . 1 . A bipolar junction transistor 


4. Block data structure and scaling. In Newton’s method, a system of linear 
equations with the Jacobian is solved every iteration. Since device simulation involves 
a system of coupled partial differential equations, it is natural to exploit the block 
structure of the Jacobian in order to obtain a good preconditioner. Two methods 
which use this concept are the Alternate Block Factorization (ABF) [3] and the Modified 
Singular Perturbation (MSP) [28]. 

For the purposes of illustration, assume that the device model in question is a drift 
diffusion model having three coupled partial differential equations. If the Jacobian equa- 
tions are ordered so that all the electric potential equations are grouped first, followed 
by the electron conservation equations, and then the hole conservation equations, and 
the unknowns are ordered so that all the electric potentials are first, then the electron 
concentrations, and finally the hole concentrations, then the Jacobian matrix can be 
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partitioned as 


(4.1) 


J = 


J<j>4> 
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where J,j , i,j = <j),n,p denotes the block of derivatives of the equation for electric 
potential, electron conservation, or hole conservation, with respect to the electric po- 
tential, electron concentration or hole concentration. Then the ABF preconditioner 
is 



D<j>n 

D<t> p 

&n<i> 

D nn 

D np 
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D pn 
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where D tJ is the diagonal matrix of J i} . More recently, the Approximate Block Elimi- 
nation (ABE) has been proposed which uses an incomplete 3x3 block factorization of 
the original block structured Jacobian. 

In this paper, another kind of block structure of the Jacobian matrix is used. The 
unknown variables at each physical grid node are grouped together to form an n x n 
block matrix, where n is the number of grid nodes. In general, the diagonal block 
elements of this block Jacobian may have different sizes since the number of unknowns 
for each node varies. This is due to the fact that different models are used in different 
device materials and at the device contacts. The motivation for explicity considering 
the block structure are the following: 

All the nonzero blocks are regarded as dense. Consequently, we need only to specify 
the sparsity pattern for the nonzero blocks. As a result, the integer space needed 
for specifying the structure of the block sparse matrix is an order of magnitude less 
compared to the original scalar sparse matrix (assuming that the average number of 
unknowns per node is at least three). This block structure will be more attractive when 
the semiconductor model becomes more sophisticated, since typically, more detailed 
physics requires more equations per node. Our study [7] indicates that one of the basic 
operations in iterative methods, namely matrix vector multiplication, takes less time if 
a block data structure is used compared to the scalar case. It is clear that the cache 
hit ratio will be higher in the block case. 

More importantly, the new block structure can allow us to reduce the strong cou- 
pling between unknowns associated with a single grid node before the iterative process 
and other preconditioning steps begin. If we order the unknowns and equations so that 
all equations and unknowns associateds with a node are ordered consecutively, then 

J n Ji2 • • • Jin 

J 21 J 22 ■ ■ • J2n 

Jn 1 Jn 2 ' ' ' Jnn 
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Table 4.1 
Block scaling 


Test 

Methods 

Total linear 

Total linear 

(N) 

applied 

iterations 

CPU time 

MOSFET 

ABF 

1723 

1772.08 

(15583) 

Two step 

73 

465.04 

BJT 

ABF 

Not converge 


(13758) 

Two step 

117 

523.54 


where J„ are typically 1 x 1, 2 x 2,- • 7 x 7 block matrices. For drift-diffusion models, 

the maximum size of is usually four (some of the nodes may have the electric current 
as an additional unknown). At each Newton step we need to solve the equation 

(4.2) JAx = -r 

where J is the Jacobian, Ax is the vector of updates, and r is the residual vector. 
First, a block scaling is applied to equation(4.2). The scaling matrix 

21 


is multiplied on both sides of equation (4.2). The preconditioned Krylov space method 
is then applied to the scaled matrix equation 

SJAx = Sr. 

Note that this scaling step is equivalent to applying the ABF preconditioner if a per- 
mutation is applied. The difference here is that a further preconditioning process will 
be applied to the scaled system. The improvement in using a block scaling followed by 
further preconditioning is significant compared to using block scaling alone. 

Table 4.1 shows the difference in performance if a further preconditioning step is 
taken. Two Jacobian matrices were generated at intermediate Newton iterations from a 
typical simulation. The two step preconditioner uses a block scaling followed by the best 
preconditioned BI-CGSTAB method which will be described in more detail in following 
sections. Table 4.1 shows the total number of solver iterations for all Newton iterations, 
as well as the total number of unknowns N. Clearly, block scaling (ABF) alone is not 
sufficient for a robust technique. 

5. Preconditioning Issues . Preconditioned Krylov subspace methods are the 
standard iterative methods for device simulation. Much of the past research has con- 
centrated on the behavior of different acceleration methods [1, 4, 5, 15, 18, 21, 22, 24]. 
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Many authors have observed that GMRES, CGS and Bi-CGSTAB are the relatively 
robust choices among the many. Our experience agrees with previous work. In partic- 
ular, we found that Bi-CGSTAB is generally the most robust method. For example, 
Figure 5.1 shows the total CPU times for problem the MOSFET problem for various 
values of the Drain-Source voltage. Clearly, Bi-CGSTAB is the most efficient accel- 
eration method. Tests for other problems showed a similar trend. Consequently, all 
computations will be carried out using Bi-CGSTAB for the remainder of this paper. 


CPU seconds 



Drain-Source Voltage (V) 

Fig. 5 . 1 . Comparison of various acceleration method$(5.1) 

For completeness, we give the preconditioned Bi-CGSTAB algorithm below. 
(LU represents the incomplete factorization of matrix A and ~Aq is a vector.) 

r Q = b — Ax o 
Cjq = (3 = a 0 = 1 
Aq 0 = q 0 = 0 
For i=l,2,... 

0 = (r 0 , r,_i); u>i = {0/0)(u>i 

0 = 0 

= r*_ i — Oi-iAqr,.!) 

q t = (LU)~ 1 qi 
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M, = Mi 

= M r o, A<?,) 

5 = r,-_i - <2\Aq, 

s=(LU)~ 1 s 

t = As 

a, = (L- 1 t,L- 1 s)/(L- 1 t,L~ 1 t) 

Xi = Xi_i + a >qi + a,s 

if Xi accurate enough, then quit 
r, = 5 — Q{t 

End 

Although a suitable acceleration scheme may improve the performance by 10-30% 
on average, a good preconditioner may speed up convergence by an order of magnitude 
[4, 27]. In this paper, we will concentrate on the issues related to incomplete LU 
preconditioners, since these are regarded as among the most robust for semiconductor 
simulations [4, 22]. 

Before we pursue the different issues in constructing a good preconditioner, a block 
graph representation of the Jacobian is useful. If we view the grid node v, as the node 
of a graph representing the block sparse matrix and map the nonzero block element J,j 
to an edge connecting the two nodes Vi, vj , the block sparse Jacobian 

Jin 
‘ ‘ • Jin 

Jr in . 

can be represented by a graph form. During the factorization process, new nonzero 
block elements will be introduced. A notion of “fill level” can be introduced to each 
position of the block matrix. We define initially 

(5.2) level]? = { °’ lf J ' 3 t °’ 

3 ( oo, otherwise. 

The nonzero block elements then have a fill level 0. As the elimination proceeds, fill 
will be introduced. At the k th step of the elimination, the fill levels are given by 

levelj^ := min (levelj£ -1 * + level** -1 * + 1, level]* -1 *) . 

In other words, the fill level of a fill is the length of the shortest path between node u, 
and Vj through the nodes eliminated before v, and v 3 [20]. The fill level is commonly 
used to decide the sparsity pattern of an ILU factorization. 
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5.1. Ordering. The matrix ordering affects the computational efficiency of a ma- 
trix solver in many ways. For direct methods, a good ordering technique is essential in 
order to minimize the amount of fill. In parallel (or vector) processing, ordering again 
plays a crucial rule. A number of studies have examined the effect of matrix ordering 
on the quality of preconditioners for iterative methods based on an incomplete factor- 
ization [6, 10, 11, 16, 17, 23, 14]. In [10, 11, 12] evidence was presented to demonstrate 
that matrix ordering can have a profound effect on the quality of preconditioners. A 
heuristic method was developed that was shown to produce good matrix ordering. Un- 
fortunately, some of the techniques for an effective ordering developed in [10, 11, 12] 
can only be applied to scalar sparse matrices. The ordering generated by these new 
algorithms may destroy the block pattern we employed here. However, the graph based 
orderings (where we view each block of the Jacobian as a node in the graph) can handle 
the block matrix (5.1) naturally. 

The algorithm for generating a scalar ILU preconditioner can be directly extended 
to the block case. The block incomplete LU factorization can then be interpreted 
as a graph elimination process[20]. CHORDV uses a box integration method for the 
discretization of the partial differential equations. The graph representation of (5.1) 
is in fact the graph of the grid used to discretize the device. Note that most of the 
grid generation algorithms in semiconductor simulation involve some kind of refinement 
process. Consequently, the ordering of the grid nodes can be very scattered. It is well 
known that a scattered or random ordering is very effective for ILU methods [ ]. In 
addition, after the refinement the resulting grid is usually unstructured. Consequently, 
there is no obvious natural way to order the grid nodes. The Reverse Cuthill-McKee 
(RCM) ordering [20] originally was proposed as a good technique for reducing the profile 
of a sparse matrix. The basic idea of the ordering algorithm is to construct the level 
set from a starting node of the graph representing the sparse matrix. The reverse order 
of the level set will produce a small profile. We can also view the RCM ordering as 
a generalized natural ordering of an unstructured grid. For a rectangular grid, RCM 
ordering is the diagonal ordering of the mesh , if a corner node is chosen as the starting 
node. This ordering algorithm can be naturally generalized to the block Jacobian case. 

RCM ordering is known as an effective ordering [16, 17] for ILU preconditioning 
in many applications. The standard RCM ordering can produce a poor ordering (for 
an iterative solver) if the problem is anisotropic. A revised version for a weighted 
graph can alleviate this problem [8]. A comparison of an ILU method which uses the 
original ordering and the RCM ordering was carried out for Test Problem 3. WHERE 
IS THIS PROBLEM DESCRIBED???? Figure 5.2 give the total CPU time required 
to obtain the steady state solution for various values of the Drain-Source Voltage for 
Test Problem 3, using the original ordering (ORG), RCM ordering (RCM), and, for 
comparison, the time for a direct solver (SPARSPAK) [ ] is also shown. Her,e an 
ILU(O) preconditioner is used with Bi-CGSTAB acceleration. The three curves On 
average, RCM ordering reduces the total CPU time by about 30-40% compared to the 
original ordering. Consequently, in the following, RCM ordering will be used unless 
otherwise noted. 
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CPU seconds 



Drain-Source Voltage (V) 

Fig. 5.2. Comparison of direct solver (Sparspak) and iterative solver (Math), RCM ordering 
(Matb.RCM) and ORI ordering ( Matb.ORl ). ILU(O) preconditioner and Bi - CGSTAB acceleration used 
in the iterative solver. 

5.2. Sparsity pattern. The key step during the incomplete LU factorization pro- 
cess is to determine the sparsity pattern of the L and U. To find the optimal sparsity 
pattern for the ILU factorization is a much more difficult task than the solution of the 
Jacobian itself. Typically, some simple heuristics are used for determine if a fill element 
will be discarded. The common strategies are: 

1. By a drop tolerance, ILU(e). 

Let J-j be the submatrix which remains after (k — 1) steps of (incomplete) 
elimination [????]. The drop tolerance method discards fill if 

( 5 . 3 ) l4‘> I < £ V /|4‘ l 4‘ , l- 

There are many possible drop criteria which can be used [ ]. We will use 
the criteria (5.3) since it is similar to that used in [ ]. Note that since the 
Jacobian matrix is not symmetric positive definite, we do not use the diagonal 
modification suggested in [ ]. Even for symmetric positive definite problems, 
the diagonal modification usually results in a slow method [ ]. 

It is probably not a good idea to extend the drop tolerance approach to the 
block case. First, the computation of the norm of the block element is not 
an inexpensive operation. As well, the coefficients in each block of the Jacobian 
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Table 5.1 

Drop Tolerance Compared to Two Step 


Test 

(N) 

Methods 

applied 

Total linear 
iterations 

Total linear 
CPU time 

Double 

workspace 

MOSFET 

(15583) 

Two step 

73 

465.04 

1,087,806 

£ = 0.5 

741 

2426.10 

660,080 

£ = 0.01 

660 

2370.84 

729,729 

£ = 0.001 

416 

2508.34 

1,141,275 

£ = 0.0001 

363 

4071.69 

1,784,119 

BJT 

(13758) 

Two step 

117 

523.54 

987,151 

£ = 0.5 

1439 

3700.06 

534,994 

£ = 0.01 

1226 

3320.64 

572,442 

£ = 0.001 

1029 

3505.31 

772,080 

£ = 0.0001 

579 

2980.67 

1,108,358 


in device simulation often vary by many orders of magnitude (lO 10 — 10 16 ). One 
large entry in a block element may result in keeping the entire fill block, which 
may not be desirable. Consequently, the drop tolerance method will be applied 
to each individual element in the Jacobian matrix in the following. A similar 
approach was used in [ ]. 

In general, we have found that a drop tolerance incomplete factorization is not 
an effective technique for semiconductor device simulations. The characteris- 
tics of the Jacobian in this particular application cause problems for a drop 
tolerance approach. It is well known that the Jacobian in device simulation is 
extremely ill conditioned. In other words, a small perturbation in the factor- 
ization may cause a large change in preconditioner. The basic idea of the drop 
tolerance heuristic becomes questionable in this case. Many experiments have 
supported this observation [9]. It often happens that a smaller tolerance may 
result a “worse” preconditioner. 

The drop tolerance method (applied with criteria (5.3) to each element of the 
Jacobian) was compared with use of a two stage method (to be described in the 
following Section). Table 5.1 shows the total CPU time required for the matrix 
solve, the number or iterations, and the storage required (for a single value of 
the source-drain voltage), for various values of the drop tolerance e. For the 
drop tolerance method, the total CPU times do not decrease monotonically 
as the drop tolerance is decreased, which is somehwat disturbing. The drop 
tolerance preconditioner is between six and ten times slower than the Two step 
preconditioner. It is clear that, for the same amount of storage, the Two step 
method is far superior to the drop tolerance approach. 

2. By fill level, ILU{1). 
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The fill block j\^ at k th step of the factorization will be discarded, if 

level,'** > £. 

It is clear that the larger the £ the better the preconditioner. When £ = n, 
a complete factorization is obtained. However, large values of £ will be too 
expensive in terms of storage for 3-D problems. Our experiments indicate that 
the performance (in terms of total CPU time) stops improving after £ = 2 even 
for 2-D problems. 

For the level approach, only one symbolic factorization is needed for the entire 
simulation as long as the grid does not change. The sparsity patterns of the 
incomplete factorization are the same for different Newton steps or time steps. 
As a result, the numerical and symbolic factorization can be separated to make 
the factorization process more efficient. This contrasts with the drop tolerance 
incomplete factorization, since a different sparsity pattern will result when 
the Jacobian changes. Therefore, the incomplete factorization process is more 
expensive for a drop tolerance preconditioner. The level approach is not only 
easy to implement, it is even more efficient for the block Jacobian case. Indeed, 
the symbolic factorization phase in this case costs only a very small fraction 
of symbolic factorization cost if the Jacobian was not considered as a block 
matrix. 

Table 5.2 shows the total CPU times for Test Problem ???? for various levels 
of fill of the ILU. The improvement from level zero to level one is significant. 
However, the improvement in going to levels higher than one is marginal. We 
also list a detailed record of a typical simulation in Table 5.2 (for a single 
value of the source-drain voltage). We can see that the number of iterations 
is monotone decreasing as the level increase. However, the amount of fill for 
the preconditioner becomes larger as the level of fill increases. Therefore, the 
higher cost for each iteration will eventually outweigh the reduction in number 
of iterations. 


Table 5.2 

Comparison of Levels 0, 1, 2, 3 and f. 


Method 

Newton 
iter # 

Linear 
iter # 

Total CPU 
linear time 

Total CPU 
time 

Double 

workspace 

Level(O) 

13 

178 

701.33 

898.18 

830998 

Level(l) 

12 

111 

510.02 

681.79 

923823 

Level(2) 

12 

85 

485.42 

670.90 

1087806 


12 

83 

558.10 

729.52 

1264461 

Level(4) 

12 

70 

580.68 

768.87 

1437259 

Direct Method 

12 



1870.62 

2529297 
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CPU seconds 



Drain-Source Voltage (V) 

Fig. 5.3. Comparison of Level 0, 1, 2 and 3 using one-step preconditioner in Bi-CGSTAB accel- 
eration and RCM ordering. 

3. By the combination of both, ILU(t,e). 

A fill entry is dropped if 

level]** > l or | J,j| < e.. 

This is an useful heuristic for many applications [11]. However, since the drop 
tolerance approach by itself does not appear to be very useful in this applica- 
tion, we will not pursue this method further. 

5.3. Two step preconditioner. As we can see from the experiments of the pre- 
vious sections, when the accuracy of the factorization (by using a higher level fill or 
a smaller drop tolerance) improves, the number of iterations required for convergence 
decreases. However, the improvement in the number of iterations will not compensate 
for the higher cost of each iteration after a certain point. The best combination of the 
strategies thus far is to use RCM ordering plus ILU(l) or ILU(2). For 3-D problem, the 
ILU(2) approach may require too much storage. In the following, we will use an ILU(l) 
factorization unless otherwise noted. 

Let Lj and Uj be the incomplete factorization of the Jacobian matrix J. The 
techniques used in the previous sections are attempt to reduce the difference 

WJ-LjVA- 
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The basic assumption of this method is that if this difference is small, then the solution 
y of the preconditioning step 

(5.4) LjUjy = p 

will be close to the solution of 


Jx = p. 

Alternatively, we can use the following criteria for producing an effective precondi- 
tioner. Denote Mj a preconditioner of J and y the solution of 

M,y = p. 

A better preconditioner Mj means a smaller residual of 

(5.5) r = Jy — p. 

When r = 0, a perfect preconditioner is obtained. Therefore a refined preconditioner 
or a two step preconditioner can be developed as follows. 

For the purposes of illustration, assume that the variables are ordered as in (4.1). 
After we solve the equation (5.4), the residual r in (5.5) is calculated. Let be the 
diagonal block element for the potential variable in (4.1) and be the part of residual 
in (5.5) corresponding to the electric potential equations. The equation 

(5.6) Ju&y* = r <t>- 

is solved. This problem is much smaller and better conditioned. Let A y = [ Ay^, 0, 0]^, 
i.e. only the potential variables become updated. Then, the refined solution y = y — Ay 
is used as the solution for the new two step preconditioner M t such that 

M t y - p. 

It is easy to see that the new residual 

r = Jy - p 


will have 


= 0 . 

In another words, the two step preconditioner has more accurate electric potential solu- 
tion. Consider a case where the drift flow of holes and electrons dominates the diffusion 
flux. In this case, as the mesh size is reduced, the electric potential derivatives in the 
Jacobian will dominate the hole and electron concentration derivatives. Essentially, this 
is because in this situation, the electric potential is a elliptic type variable, while the 
hole and electron concentrations are hyperbolic type variables. 
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For the range of two dimensional problems we have tested so far, we have found that 
us of a direct solve of equation (5.6) is quite efficient. In other words, is factored 
once, and equation (5.6) is solved by a forward and back solve each iteration. We use 
minimum degree ordering [ ] for the initial complete factorization of J ^ (this system is 
much smaller than the original Jacobian). Of course, for larger 3-D problems, it may 
be more efficient to use an iterative method to solve equation 5.6. 

Note that the two step method is similar to the Combinative technique used in [ 
]. Tables 5.3, and 5.4 present some detailed comparisons between the one-step and two 
step preconditioners for Test Problem ?????. We can see that the total number of linear 
iterations is reduced significantly with the two step preconditioner. Fig. 5.4 lists the 
CPU time comparison for a complete test run. Of course, the new preconditioner is 
more expensive than the single step preconditioner. However, the larger reduction in 
the number of iterations compensates the extra cost in each iteration. 

In Table 5.4, we present a detailed performance comparison between many different 
techniques (Note that the CPU clock is only accurate to within 5%). It is clear that a 
careful implementation of the preconditioned Krylov space method is very important for 
performance. Although some extra memory is needed for the small system (5.6) (see the 
comparison in Table 5.3), the total memory requirement for the two step preconditioner 
is still competitive with a direct method. 

CPU seconds 



Fig. 5.4. Comparison of one-step and two-step preconditioners. 
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Table 5.3 

Comparison of one-step and two-step preconditioner in Bi-CGSTAB 


Test 

Bi-CGSTAB 

Newton 

Linear 

Total CPU 

Total CPU 

problem 

(level: step) 

iter # 

iter # 

linear time 

time 


Bi-CGSTAB(1:1) 

20 

459 

1051.62 

1360.65 

problem 1 

Bi-CGSTAB(1:2) 

20 

260 

1091.46 

1407.89 


SPARS 

20 



3039.95 


Bi-CGSTAB(1:1) 

31 



3494.40 

problem 2 

Bi-CGSTAB(1:2) 

31 



3077.40 


SPARS 

31 



4701.80 


Bi-CGSTAB(1:1) 

48 

2663 

5489.89 

6222.92 

problem 3 

Bi-CGSTAB(1:2) 

48 



5369.92 | 


SPARS 

48 



7348.77 


Bi-CGSTAB(1:1) 

50 

2833 

5704.77 

6461.42 

problem 4 

Bi-CGSTAB(1:2) 

50 

1194 

4222.25 

4985.08 


SPARS 

50 



7883.16 


Bi-CGSTAB(1:1) 

57 

3203 

6397.02 

7279.58 

problem 5 

Bi-CGSTAB(1:2) 

57 

1427 

5050.29 

5899.39 


SPARS 

57 



8809.89 


6. Conclusion. 
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Table 5.4 

A performance comparison between different implementations 


■gB 

WSSm 

Methods 

applied 

Linear 
iter # 

Total CPU 
linear time 

Total CPU 
time 

Double 

workspace 

First 

device 

(15583) 

Two step 

81 

411.70 

597.00 

970423 

Level 2 

218 

621.27 

786.00 

741208 

Level 1 

292 

702.52 

875.00 

577225 

AFB 

1723 

1697.67 
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1LU(0.5) 
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2579.00 
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ILU(O.Ol) 
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ILU(O.OOOl) 

363 

4071.69 

4226.00 

1784119 

Second 

device 
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** 



437217 
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1226 
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3465.00 
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ILU(O.OOl) 

1029 
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ILU(O.OOOl) 

579 
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1108358 
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